Abstract
Background In recent years, an increasing number of health chatbots has been published in app
stores and described in research literature. Given the sensitive data they are processing
and the care settings for which they are developed, evaluation is essential to avoid
harm to users. However, evaluations of those systems are reported inconsistently and
without using a standardized set of evaluation metrics. Missing standards in health
chatbot evaluation prevent comparisons of systems, and this may hamper acceptability
since their reliability is unclear.
Objectives The objective of this paper is to make an important step toward developing a health-specific
chatbot evaluation framework by finding consensus on relevant metrics.
Methods We used an adapted Delphi study design to verify and select potential metrics that
we retrieved initially from a scoping review. We invited researchers, health professionals,
and health informaticians to score each metric for inclusion in the final evaluation
framework, over three survey rounds. We distinguished metrics scored relevant with
high, moderate, and low consensus. The initial set of metrics comprised 26 metrics
(categorized as global metrics, metrics related to response generation, response understanding
and aesthetics).
Results Twenty-eight experts joined the first round and 22 (75%) persisted to the third round.
Twenty-four metrics achieved high consensus and three metrics achieved moderate consensus.
The core set for our framework comprises mainly global metrics (e.g., ease of use,
security content accuracy), metrics related to response generation (e.g., appropriateness
of responses), and related to response understanding. Metrics on aesthetics (font
type and size, color) are less well agreed upon—only moderate or low consensus was
achieved for those metrics.
Conclusion The results indicate that experts largely agree on metrics and that the consensus
set is broad. This implies that health chatbot evaluation must be multifaceted to
ensure acceptability.
Keywords
health chatbots - conversational agents - performance measures - evaluation framework
- Delphi study